December 2021
This project was built during my studies in the MSc Statistics for the module "Managing and Visualizing Data".
It was a group project among 3 people and the contributions of each are shown at the end.
from IPython.display import Image
Image(filename='car-image.jpg', width =600, height=300)
The objectives of this project are to start exploring and analysing the accidents that happened in the United States in February and March of 2016. The purpose of this is to get insights on the circumstances under which an accident is more prone to happen, its severity and some of the consequences that an accident has in the roads of US. This analysis is intended to be used by the authorites, to prevent accidents by understanding where their budget and efforts should focus on, in terms of road infrastructures and decisions that should be taken on the day based on the weather forecasts.
Furthermore, modelling techniques are used to predict the severity of the accident. A game theoretic approach is used to explain gthe output of one of the models, and the most important features are revealed. Linear regression is used to hypothesise whether there is a correlation between the population of a state and the number of accidents.
The main features that are taken into account and analysed in relation to the accidents and their severities are road features at the point of the accident, day/time, weather conditions, city and state characteristics and traffic of the road of interest.
The datasets used in this project are coming from the following sources:
To address the asserted objectives our analysis covers characteristics that could be also grouped in the below categories:
1. Setup Link to the destination
2. Data Preparation Link to the destination
2.1 Read US Data Link to the destination
2.2 Initial Data Exploration Link to the destination
2.3 Data Cleansing Link to the destination
3. Data Exploration Link to the destination
3.1 Weather and population data Link to the destination
3.2 Datetime data Link to the destination
3.3 Location data Link to the destination
3.4 Vehicle Ownership data Link to the destination
3.5 Traffic Google Maps data Link to the destination
3.6 Accident Description data Link to the destination
4. Dimensionality Reduction Link to the destination
4.1 City and State variables Link to the destination
4.2 Weather Condition variable Link to the destination
5. Model Evaluation Link to the destination
5.1 Multi-class classification Link to the destination
5.2 Binary classification Link to the destination
5.3 Feature Importance Link to the destination
6. Main findings Link to the destination
7. References Link to the destination
8. Individual Contribution Link to the destination
Import the neccessary packages for the project:
import pandas as pd
#from outscraper import ApiClient
from datetime import datetime
import numpy as np
import json
import math
import missingno as msno
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import calendar
from sklearn import (manifold, decomposition)
# libraries for model selection and evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, confusion_matrix, precision_recall_curve, average_precision_score, accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize
import shap
# libraries for model fitting
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost
import tabulate
#for webscrapping
from bs4 import BeautifulSoup
#for visualising interactive maps
import os
import folium
from folium import plugins
import rioxarray as rxr
import earthpy as et
import earthpy.spatial as es
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame
import plotly.graph_objects as go
import plotly.express as px
#for text analysis
from sklearn. feature_extraction. text import CountVectorizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from matplotlib.colors import LinearSegmentedColormap
import colorama
from colorama import Fore
import yaml #fpr printing a dictionary in a clean way
from IPython.core.display import HTML as Center #to align all the plots in this notebook in the center
from adjustText import adjust_text
from xgboost import XGBClassifier
import xgboost
Supress code warnings to make the notebook cleaner:
import warnings
warnings.filterwarnings('ignore')
Expand the output in the notebook, so that the user doesn't need to scroll:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
Code to align all the plots in this notebook in the center
Center(""" <style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style> """)
Read the US Accidents dataset using the csv file found on the Kaggle website [1]. The csv file was first loaded locally.
Note: If you need to run the code below, please download the csv file from kaggle, as it was too large to push into github.
The file has 3 datetime variables, which are read with the equivallent format.
accidents_df = pd.read_csv ('US_Accidents_Dec20_updated.csv', parse_dates =['Start_Time', 'End_Time', 'Weather_Timestamp'])
accidents_df.head()
For the purpose of Data Cleansing, some initial data exploration is taking place. This exploration gives direction to the data cleansing needs.
year, month and hour and create 3 respective columns Start_Time column.Temperature(F) becomes Temperature_F. Precipitation_in, Wind_Chill_F, Number)Country column that has only 1 distinct value populated in all records.Get a view of all the columns and their type.
accidents_df.info()
From the Start_Time column extract the year, month and hour and create 3 respective columns to use them as additional features.
accidents_df['year'] = pd.DatetimeIndex(accidents_df['Start_Time']).year
accidents_df['month'] = pd.DatetimeIndex(accidents_df['Start_Time']).month
accidents_df['hour'] = pd.DatetimeIndex(accidents_df['Start_Time']).hour
Filter the dataframe to contain only accidents that happened in February or March of 2016.
accidents_2016_df = accidents_df[ (accidents_df['year']==2016) & (accidents_df['month'].isin([2,3]))]
print('The new dataframe has ' + str(accidents_2016_df.shape[0]) + ' rows.')
Rename columns that in the original dataset have symbols that would complicate the code, like parenthesis.
accidents_2016_df = accidents_2016_df.rename({'Temperature(F)': 'Temperature_F',
'Visibility(mi)': 'Visibility_mi',
'Distance(mi)': 'Distance_mi',
'Wind_Chill(F)': 'Wind_Chill_F',
'Humidity(%)': 'Humidity_prc',
'Pressure(in)': 'Pressure_in',
'Wind_Speed(mph)': 'Wind_Speed_mph',
'Precipitation(in)': 'Precipitation_in'
},
axis='columns')
Web-Scrapping to get full names of the states, using their Abbreviations
accidents_2016_df = accidents_2016_df.rename(columns={'State':'State abbreviation'})
# Change abbreviation to full name of states by websrapping. This is faster than manually inputting the names and abbreviations
res = requests.get("https://abbreviations.yourdictionary.com/articles/state-abbrev.html")
soup = BeautifulSoup(res.content, "html.parser")
states = soup.find('table')
df1 = pd.read_html(str(states))
df1 = pd.DataFrame(df1[0])
df1 = df1.iloc[1:,:2]
df1 = df1.rename(columns={0:'State', 1:'State abbrev'})
accidents_2016_df = accidents_2016_df.merge(df1, how='left', left_on = 'State abbreviation', right_on='State abbrev')
accidents_2016_df.drop(columns = 'State abbrev', inplace = True)
Check whether there are any duplicates. The below result is zero, hence there are no duplicates in the dataset
accidents_2016_df.duplicated().sum()
Get the number of distinct values for each character column, in order to decide on the way they will be treated, based on wheter their cardinality level is low or high.
#Get a list of the character columns
character_cols = accidents_2016_df.select_dtypes(include=['object']).columns
#Get the number of distinct values for each character column
dic = {}
for c in character_cols:
dic[c] = accidents_2016_df[c].nunique()
#Sort them by ascending order
dic = dict(sorted(dic.items(), key=lambda item: item[1]))
#Print the dictionary nicely
print(yaml.dump(dic, sort_keys=False, default_flow_style=False))
Display frequency bar plots for the categorical variables that were found to have a cardinality of 3 or less.
low_cardinality_cols = ['Side', 'Sunrise_Sunset','Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight','Timezone']
fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(15, 10), sharey=True)
for i, c in enumerate(low_cardinality_cols):
low_card_df = accidents_2016_df.groupby(c)[c].agg('count')
if i <3:
low_card_df.sort_values(ascending=False).head(20).plot(kind='bar', color='b', alpha=0.4, ax=axes[0,i], fontsize =12)
axes[0,i].set_xlabel(c, fontsize =15);
else:
low_card_df.sort_values(ascending=False).head(20).plot(kind='bar', color='b', alpha=0.4, ax=axes[1,i-3], fontsize =12)
axes[1,i-3].set_xlabel(c, fontsize =15);
fig.subplots_adjust(hspace = 0.5); #adjust the plots so that the first row has some space from the second row
plt.title('Frequency bar plots',fontsize=24, y =2.9, x = -0.75);
plt.suptitle('(for categorical columns with low cardinality)', y=0.98, fontsize=15);
#Add separate plot for the country
fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(4.5,4.5))
low_card_df = accidents_2016_df.groupby('Country')['Country'].agg('count')
low_card_df.sort_values(ascending=False).head(20).plot(kind='bar', color='b', alpha=0.4, fontsize =12);
axes.set_xlabel('Country', fontsize =15);
Findings:
From these results we see there are 5 variables (Side, Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight) with a granularity level of 2. We will transform these variables to dummy variables so that they become numeric. Transforming a categorical variable of level 2 cardinality into dummy variable doesn't increase the complexity of the dataset and hence of the models that will run on it, as only 1 dummy variable will be created. Timezone has also a low cardinality (3), so will converted to 3 dummy variables, as for interpretability reasons we wouldn't like to use a reference group for this variable.
Country column has the same value in all records (US), hence we will drop this column, as it doesn't provide any information.
Plot the percentage of missing values for each column that has missing values.
# Create a function that counts the number of missing values and the percentage, stores them in a dataframe
# and sorts it in descending order so that the columns with the most missing values appear first
def calculate_missing_values(input_df):
#Count the number of missing values for each column and store the results in a named panda Series
missings_count = pd.Series(input_df.apply(lambda x: x.isnull().sum()), name ='missings_count')
#Count the percentage of missing values for each column and store the results in a named panda Series
missings_perc = pd.Series(input_df.isnull().sum() * 100 / len(input_df), name ='missings_perc')
column_type = pd.Series(input_df.dtypes, name ='column_type')
#Concatinate the previous panda Series in a dataframe
missings_df = pd.concat([column_type, missings_count, missings_perc], axis =1)
#Sort the dataframe in decending order
missings_df = missings_df.sort_values(by='missings_count', ascending =False)
return missings_df
missings_df= calculate_missing_values(accidents_2016_df)
#Plot percentage of missing values for each column
missings_perc_ser = pd.Series(missings_df['missings_perc'], name ='missings_perc')
missings_df['missings_perc'][missings_perc_ser != 0].plot(kind= 'barh',
xticks = [int(x) for x in np.linspace(0,100,11)],
title = 'Percentage of missing values',
figsize = (9,6));
Observations:
From the bar plot above it is seen that 3 variables Wind_Chill_F, Number and Percipitation_in have missing values, more than 69%. We consider this a high percentage, hence we will drop these variables in the "Manipulation" part.
It is also seen that about 5 weather related variables Visibility_mi ,Humidity_prc ,Temperature_F, Pressure_in and Wind_Speed_mph have about 2% of missing values and 12% for the latter one. For these variables we will perform imputation using the time interpolation method, after grouping them by city, so that weather values are only interpolated based on the weather of the same city.
Drop the 3 variables that have more than 69% null values, and the Country column that has only 1 distinct value populated in all records.
accidents_clean_temp_df = accidents_2016_df.drop(['Precipitation_in', 'Wind_Chill_F', 'Number', 'Country'], axis = 1)
Impute the missing records of the weather related variables, with the method of time interpolation in both directions, grouping the records by city, so that interpolation happens only within the same city. The reason for that is that the weather is highly possible to be similar within a few days within one city.
#Get all the city names that appear in the dataset, into one array
cities = accidents_2016_df['City'].unique()
num_cols_for_imputation = ["Wind_Speed_mph" ,"Visibility_mi" ,"Humidity_prc" ,"Temperature_F" ,"Pressure_in"]
imputed_city_df = accidents_2016_df[['ID','Start_Time', 'City',"Wind_Speed_mph" ,"Visibility_mi" ,"Humidity_prc" ,"Temperature_F" ,"Pressure_in"]] #temp - the column selection is temp
imputed_city_df = imputed_city_df [imputed_city_df['City'].isin(cities)] #temp
#Store the initial dataframe in another object, which will be updated iteratively below
initial_df = imputed_city_df #accidents_2016_df
#Loop through all the city names:
for c in cities:
imputed_city_df = accidents_2016_df[['ID','Start_Time', 'City',"Wind_Speed_mph" ,"Visibility_mi" ,"Humidity_prc" ,"Temperature_F" ,"Pressure_in"]]
#Create a dataframe that contains only records from a particular city, which will be used to interpolate the missing values
imputed_city_df = imputed_city_df [imputed_city_df['City'] == c]
#Make the current index of the dataframe a column and name it as 'The_org_index'.
#This is because later on we need to change the index of the column and then revert it back to the initial one.
imputed_city_df['The_org_index'] = imputed_city_df.index
#Set the index of the dataframe to be the 'Start_Time', because we want the interpolation to use it as an index.
imputed_city_df = imputed_city_df.set_index(['Start_Time'])
#Interpolate the missing values in both directions
imputed_city_df = imputed_city_df.interpolate(method='time', limit_direction ='both')
#Set the index of the dataframe to the original index
imputed_city_df = imputed_city_df.set_index(['The_org_index'])
for index, row in imputed_city_df.iterrows():
id_new = imputed_city_df.loc[index, 'ID']
for col in num_cols_for_imputation:
temp_new = imputed_city_df.loc[index, col]
initial_df.loc[initial_df['ID']==id_new, col] = temp_new
num_imputation_df = initial_df.set_index(['ID'])
num_imputation_df
Combine the dataset with the imputed numerical columns with the rest of the columns
num_imputation_df2= num_imputation_df.drop(['Start_Time','City'], axis = 1)
imputed_cols = num_imputation_df2.columns
accidents_clean_temp_df2= accidents_clean_temp_df.set_index('ID')
accidents_clean_temp_df2 = accidents_clean_temp_df2.drop(imputed_cols, axis = 1)
accidents_clean_temp_df3 = pd.concat([num_imputation_df2, accidents_clean_temp_df2], join = 'inner', axis = 1)
Check whether there are any more missing values left by ploting the percentage of them:
missings_df2= calculate_missing_values(accidents_clean_temp_df3)
#Plot percentage of missing values for each column
missings_perc_ser = pd.Series(missings_df2['missings_perc'], name ='missings_perc')
missings_df2['missings_perc'][missings_perc_ser != 0].plot(kind= 'barh',
xticks = [int(x) for x in np.linspace(0,100,11)],
title = 'Percentage of missing values after imputation',
figsize = (9,6));
We can see that even though the missing percentages have decreased, there are still some missing values in the dataset.
It is observed that about 7 weather related variables have a similar missing percentage of about 1%. This brings up the question on whether these missing values appear on the same records. If this is the case we will decide to drop these records, since most of the weather related variables are missing. To figure out if this is the case, in the next step we visualize the distribution of the missing records for these 7 variables.
#Get the name of the columns, whose missing values percentage is between 0.7% to 5%:
missing_cols = list(missings_df2[missings_perc_ser != 0][missings_perc_ser < 5][missings_perc_ser > 0.7].index)
#Keep only the 7 variables that have around 1% of missing values and only the rows with at least 1 missing value.
#This will facilitate the visualisation and help us focus on the records of interest.
missing_data = accidents_clean_temp_df3[missing_cols]
missing_data = missing_data[missing_data.isnull().any(axis=1)]
#Visualize the missing values from each record for the 7 variables that have around 1% of missing values.
#Before doing that, to facilitate the visualization, we sort the dataframe, so that the NaNs appear all together
missing_data = missing_data.sort_values(by=missing_cols, ascending =False)
msno.matrix(missing_data , labels = True);
Observations:
From the visualization above it can be seen that 151 records remain with missing values on the 7 variables. A quarter of those missing values, appear in the same records, except for Wind_Speed_mph, which has more missing values. This can be seen from the bottom part of the visualisation. Because of that, droppping the records, where all these 7 variables appear as missings will be our approach.
The remaining reconds with missing values (the ones at the top of the visualization) are very few and were not imputated with the intrpolation method above, because the City which they represent had no other weather values in previous or future days. For these reasons, we decide to also drop these remaining records.
Drop remaining rows with missing values and print the new number of records in the dataset.
accidents_clean_temp_df4= accidents_clean_temp_df3.dropna()
print('The remaining number of records in the dataset is ' + str(accidents_clean_temp_df4.shape[0]))
Transform Fahrenheit to Celcius for the Temperature variable in the dataset, so that it is more meaningful to us. Therefore, instead of having the Temperature_F variable, we will have the Temperature_C variable.
For the transformation, the following equation is used.
\begin{eqnarray*} Celcius&=&\frac{5}{9}(Fahrenheit-32) \end{eqnarray*}accidents_clean_temp_df4['Temperature_C'] = ((accidents_clean_temp_df4['Temperature_F'] -32) * (5/9)).round(decimals =1)
accidents_clean_temp_df5 = accidents_clean_temp_df4.drop('Temperature_F', axis =1)
Create dummy variables for the categorical variables that have a granularity less than 4, as decided in ther 'Exploration' part of the project.
For example, for the variable Side, the dummy variable Side_R is created, which is boolean and gets the value of 1, when the accident happend on teh Right side of the road.
For the variable Sunrise_Sunset, the new dummy variable Sunrise_Sunset_Day is created, which is boolean and the value of 1 represents accidents that happened during the day and 0 the ones during the night.
The rest of the variables were treated in a similar way.
dummies_df = pd.get_dummies(accidents_clean_temp_df5, columns=['Side', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight' , 'Timezone'], prefix_sep='_')
#For the categorical variables that had only 2 levels of granularity, drop one of the 2 dummy vaiables, in order to use it as a reference group.
#This is because the 2 new dummy variables will be completely correlated, hence there will be an uneccessary complexity.
dummies_df = dummies_df.drop(['Side_L','Sunrise_Sunset_Night', 'Civil_Twilight_Night', 'Nautical_Twilight_Night', 'Astronomical_Twilight_Night'], axis =1)
#Rename some of the created dummy variables, so that tey do not contain symbols that will need a special treat.
accidents_clean_df = dummies_df.rename({'Timezone_US/Central': 'Timezone_Central',
'Timezone_US/Eastern': 'Timezone_Eastern',
'Timezone_US/Pacific': 'Timezone_Pacific',
},
axis='columns')
For this section, I utilised the BeautifulSoup package in order to scrap data via webscrapping from Wikipedia. This data is the population data of different states in America, which is useful for comparing the number of accidents in a state to the state population. The data came from the Census of 2010. I decided to use this instead of an estimate, as I thought the data would be more reliable.
url = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population#cite_note-USCBGazetteer-4"
table_class= "wikitable sortable jquery-tablesorter"
response = requests.get(url)
print(response.status_code)
soup = BeautifulSoup(response.text, 'html.parser')
pop_table = soup.find('table', {'class':"wikitable sortable"})
df = pd.read_html(str(pop_table))
df = pd.DataFrame(df[0])
#print(df.head())
#display(df.head())
wiki_data = df.rename(columns= {"2020rank":"2020_Rank", "State[c]":"State", "2020census":"2020_Population", "2010census":"2010_Population"})
wiki_data.head()
The wikipedia data was scrapped successfully, as evident by the 200 status sign.
I load the data that was cleansed earlier, in order to perform further visualisations.
First, this section gives an overview of the US traffic accidents based on the weather conditions identified and groups them based on this. Secondly, certain variables regarding the weather conditions are selected in order to see whether there is a correlation between them- severity, visibility (miles), temperature (Celsius) and wind speed (miles per hour).
The correlation pair plot map reveals that there is a lack of strong correlation between the variables. Separately, each variable reveals a pattern. For severity, there are a greater number of accidents that have a low severity measure than a higher one (more 2’s than 3’s or 4’s). For visibility, there are greater number of accidents that have further away visibility than close visibility. For temperature ©, there is a broader variation and no clear distinction between number of accidents and one specific temperature. For wind speed, there are a greater number of accidents with a lower wind speed or mid wind-speed’s than high ones.
df = accidents_clean_df.copy()
df.reset_index( inplace = True)
sns.reset_orig() # don't change the style of plots
sev_df = df.groupby('Severity')['Severity'].agg('count')
event_df = df.groupby('Weather_Condition')['Weather_Condition'].agg('count')
fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(15, 4))
sev_df.sort_values(ascending=False).head(20).plot(kind='bar', color='b', alpha=0.4, ax=axes[0])
axes[0].set_ylabel('Number of events')
event_df.sort_values().plot(kind='barh', color='b', alpha=0.4, ax=axes[1])
axes[1].set_xlabel('Number of events')
fig.subplots_adjust(wspace=0.8);
plt.show()
This plot reveals the number of severe events in February and March 2016 (based on our clean dataset), which shows that most accidents were of the '2' severity. There were similar numbers of accidents in the severity range of '3' or '4'. Similarly, in the second plot, this reveals the number of events in a certain weather condition described. It can be seen, suprisingly, that most accidents occurred under clear weather conditions. Overcast and mostly cloudly came in at second and third, which was hypothesised.
scatter = df[['Severity', 'Visibility_mi', 'Temperature_C', 'Wind_Speed_mph']]
sns.pairplot(scatter, kind='hist')
plt.show()
This pairplot shows the correlation between four variables- wind speed, temperature (C), visibility and severiyy. There is a lack of strong corerlation between either of the variables.
From the interactive map, it is revealed that most accidents occurred on the west and east coast. In particular, severe accidents tended to occur on the west coast in the states of New York City. However, there were also accidents that occurred more centrally.
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))
m = folium.Map(location=[37.0902, -95.7129], zoom_start =4)
# Display the map
m
In this first section, I load an interactive map that is zoomable and shows the whole of USA. This lets users to explore the map before plotting the accidents data.
fig = px.scatter_geo(df,lat='Start_Lat',lon='Start_Lng', hover_name="ID",height=1000)
fig.show()
In this initial map, I plotted the accidents data by longitude and latitude and used the ID as their hover names. However I improve upon this map by utilising a different method with a different library, as this one does not represent the data clear enough.
fig1 = go.Figure(data=go.Scattergeo(
lon = df['Start_Lng'],
lat = df['Start_Lat'],
text = df['Description'],
mode = 'markers',
marker = dict(
size = 5,
opacity = 0.8,
reversescale = True,
autocolorscale = False,
symbol= 'circle',
line = dict(
width=1,
color= 'rgba(102,102,102)'),
colorscale= 'blackbody',
cmin =0,
color = df['Severity'],
cmax = df['Severity'].max(),
colorbar_title= "Severity of Accidents")
))
fig1.update_layout(
title = 'Most severe US Car Accidents<br>(Hover for description)',
geo_scope='usa',
height=1000
)
fig1.show()
In this interactive map, I improved the data visualisation by adding a legend and having a different colour for each inicident severity. This revealed that most serious incidents (red and black) occurred in the North East and Mid East regions. Whereas the less severe accidents '2'(yellow) took place in the West Coast and East Coast.
df = df.set_index(pd.DatetimeIndex(df['Start_Time']))
monthly = df.resample('M').count()
yearly = df.resample('Y').count()
res_2016 = df[~(df['Start_Time'] > '2016-12-31')]
res_2017 = df[~(df['Start_Time'] < '2017-01-01')]
monthly_data = monthly[['ID']]
monthly_data.index.date
#monthly_data['Start_Time'] = pd.to_datetime(monthly_data['Start_Time']).dt.date
y = monthly_data['ID']
mylabels = monthly_data.index.date
plt.pie(y, labels = mylabels)
plt.show()
This piechart reveals the number of accidents by month, which shows that more accidents occurred in March 2016 than February 2016.
From the pie chart representations, it is clear that more accidents occurred in March 2016 than February 2016. The weather condition for most accidents were clear at a ‘1062’ count, and the second most common weather condition was ‘overcast’ at ’553’ and the third most common weather condition was ‘mostly cloudy’ at ‘378’.
We use an interactive method to re-visualise the data.
plt.style.use('seaborn')
weather_condition = res_2016.groupby(["Weather_Condition"])["ID"].count().reset_index(name='Count')
fig = px.pie(weather_condition, values='Count', names='Weather_Condition',
title='Accidents by Weather Condition (Feburary and March 2016)',
hover_data=['Count'], labels={'Count':'Count'}, width= 800, height= 800)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
This plot depicts a much clear graph of the data and allows for an interactive feature, wehreby the user can hover to see the count of accidents in each category. It also has a useful legend and you can delete conditions too by clicking on them in the legend.
weather_condition.loc[weather_condition['Count'] < 500, 'Weather_Condition'] = 'Other conditions'
fig = px.pie(weather_condition, values='Count', names='Weather_Condition', title='Accidents by Weather Condition (February and March 2016)', width= 800, height = 800)
fig.show()
To split up the accidents a bit further, I chose to visualise the three groups of most common data- Clear, Overcast and Other conditions, in order to see which was most common and by what percentage.
There were a greater number of accidents in the states of California, Ohio and New York City in comparison to any other states in the months of February and March 2016. This may be because these cities are touristic attractions and densely populated, therefore there is greater potential for accidents to occur.
state_count = res_2016.groupby(["State"])["ID"].count().reset_index(name='Count')
fig = px.pie(state_count, values='Count', names='State',
title='Accidents by State (February and March 2016)',
hover_data=['Count'], labels={'Count':'Count'}, width = 800, height= 800)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
Similarly, I used this visualisation tool to see how many accidents there were by each state in February and March 2016. It shows both the percentage, count and the state name.
plt.hist(res_2016["Visibility_mi"], bins=50)
plt.title("Histogram for Visibility(mi) of Car Accidents, 2016")
plt.xlabel("Visibility")
plt.ylabel("Frequency")
plt.show()
This histograph reveals the visibility of car accidents in 2016 by the frequency.
sns.barplot(x="Severity",y="Temperature_C", data=res_2016)
plt.title("Barplot")
This barplot shows the number of accidents by severity and temperature (C).
sns.distplot(res_2016["Wind_Speed_mph"],bins=100,kde=True);
This distplot shows the number of accidents by windspeed.
wiki_data.head()
The linear regression revealed that with an increase in population, the number of accidents certainly increased. The robustness of the ‘r squared’ value revealed the correlation to be quite strong at 0.824. This knowledge was acquired through the Linear Regression Analysis Book by Seber [7].
#print(statesdf)
#print(state_count)
merged_state=state_count.merge(df,on="State")
wholedataset_wiki= wiki_data.merge(merged_state, on="State")
wholedataset_wiki= wholedataset_wiki.rename(columns={"Count": "Accidents"})
wholedataset_wiki
wholedataset =wholedataset_wiki.groupby(["State"]).sum()
wholedataset['%'] = (wholedataset['Accidents']/wholedataset['2020_Population'])*100
wholedataset['State'] = wholedataset.index
wholedataset
wholedataset.plot(x='State', y=["Accidents", "2020_Population"], kind="bar")
ax = plt.subplot()
ax.bar(wholedataset["State"], wholedataset["2020_Population"])
ax.bar(wholedataset["State"], wholedataset["Accidents"], color="maroon")
degrees = 90
plt.xticks(rotation=degrees)
plt.ticklabel_format(style='plain', axis='y')
plt.show()
x= wholedataset["2020_Population"].astype(int)
x.describe
x= wholedataset["2020_Population"].astype(int)
x.describe
# Create figure and plot space
plt.style.use('seaborn')
fig, ax = plt.subplots(figsize=(10, 10))
# Add x-axis and y-axis
ax.scatter(wholedataset['2020_Population'],
wholedataset['Accidents'],
color='purple')
# Set title and labels for axes
ax.set(xlabel="Population",
ylabel="Accidents",
title="Accidents by state - US\n 2016")
plt.tight_layout()
#plt.plot(monthly_data.index.values, monthly_data['ID'], 'xb-')
plt.show()
from scipy import stats
x= wholedataset["2020_Population"].values
y= wholedataset.Accidents.values
res = stats.linregress(x,y)
print(f"R-squared: {res.rvalue**2:.6f}")
plt.figure(figsize=(10,8))
plt.plot(x,y, "o", label="original data")
plt.plot(x, res.intercept + res.slope*x, "r", label= "fitted line")
plt.legend()
plt.xlabel("Population")
plt.ylabel("Accidents")
plt.xticks(rotation=degrees)
plt.ticklabel_format(style='plain')
plt.show()
res_2016.describe
import matplotlib.pyplot as plt
from scipy import stats
y= res_2016["Temperature_C"].values
x= res_2016["Visibility_mi"].values
res = stats.linregress(x,y)
print(f"R-squared: {res.rvalue**2:.6f}")
plt.figure(figsize=(10,8))
plt.plot(x,y, "o", label="original data")
plt.plot(x, res.intercept + res.slope*x, "r", label= "fitted line")
plt.legend()
plt.xlabel("Visibility")
plt.ylabel("Temperature")
plt.xticks(rotation=degrees)
plt.ticklabel_format(style='plain')
plt.show()
Here, I clean and merge the two datasets on relevant information so that I can visualise it and see it there is a correlation between state size and number of accidents. I do this by utilising a linear regression tool within the SKLearn package. The linear regression revealed that with an increase in population, the number of accidents certainly increased. The robustness of the ‘r squared’ value revealed the correlation to be quite strong at 0.824. This revealed that there was a postive correlation of 0.824 between the number of accidents and state sizes, as hypothesised.
feat_keep = ['Severity', 'Temperature_C', 'Humidity_prc', 'Pressure_in', 'Visibility_mi',
'Wind_Direction', 'Wind_Speed_mph', 'Weather_Condition']
features= df[feat_keep]
corr = features.corr()
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(np.absolute(corr), annot=True)
plt.title('Correlation Heatmap', fontsize=20)
plt.show()
Lastly, there is a heatmap for data visualisation that reveals the correlation between six different factors- Severity, Temperature, Humidity, Pressure, Visibility and Wind Speed.This shows that there is high correlation between Temperature and Visibility, as well as Pressure and Visibility. The rest have a much weaker or negative correlation.
We were interested in seeing the distribution of accidents across the week. Therefore we used seaborn to generate a countplot of the accidents. Below we see that the majority of accidents occur on Thursday, Wedensday and Tuesday. In addition, weekend has the fewest accidents. This is to be expected, since most people commute for work during the week, especially on Tuesday Wednesday and Thursday. Looking at the violin plot on the left below, we see this could be the case, as the kernel density appears to be higher during the peak hours (6-10am and 3-7pm) and lower during inter-peak hours (10am-3pm) and off-peak hours (7pm-6am).
Whilst for most days most accidents have a severity level of 2, during the weekend, most accidents have a level of 4. This could be from people driving whilst alcohol impaired. This is much more likely to lead to a fatal accident. The violin plot on the right demonstrates that this could be the case, with the kernal density being higher in the evening, when people are more likely to have drank alcohol.
df = accidents_clean_df.copy()
df.reset_index( inplace = True)
df['Day'] = df['Start_Time'].dt.day_name()
df['Hour'] = df['Start_Time'].dt.hour
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
sns.countplot(ax=axes[0],x='Day', data = df).set_title('Number of accidents per weekday');
sns.countplot(ax=axes[1],x='Day', data = df, hue = 'Severity').set_title('Number of accidents per weekday');
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
sns.violinplot(ax=axes[0], data = df, x = 'Severity', y = 'Hour').set_title('Plot of hour by Severity');
fri_df = df[df['Day']=='Friday']
sns.violinplot(ax=axes[1], data = fri_df, y = 'Hour').set_title('Plot of hour on Friday');
To get an idea on the distribution of accidents across the different locations, a count plot was generated, as shown below. Here we can see that the junction has the highest number of accidents, with a value of 250, whilst the roundabout and turning loops do not have any accidents. This tells us that perhaps more work needs to be done to make junctions safer. For example, transportation companies could look at updating the road markings. Traffic signals has the second highest number of counts. Again, this is something that may need to be investigated. For example, is there a long enough gap between the green-times of different traffic signals?
This is the total number of accidents for the US. It would probably be more useful to look at specific states or cities, as different states/cities have different characteristics.
# Plot the number of accidents per location
ax, fig = plt.subplots(figsize = (15,6))
df[['Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']].sum().plot(kind='barh',
title = 'Number of accidents per location')
fig.set_xlabel('Number of accidents');
Knowing the specific states and cities that have the highest number of accidents could help us narrow down the main causes for the accidents. Below, count plots of accidents per state and city have been generated. Here, we see that California has the highest number of accidents, followed by Ohio. According to The Barnes Firm, states with a higher population have a number of cars on the road and therefore more likely to have crashes [4].
# View count of accidents per city and state
ax, fig = plt.subplots(figsize = (15,6))
city_by_accident=df.City.value_counts()
city_by_accident[:30].plot(kind='barh', title = "30 highest counts of accidents per City");
ax, fig = plt.subplots(figsize = (15,6))
state_by_accident=df.State.value_counts()
state_by_accident.plot(kind='barh', title = "30 highest counts of accidents per State")
Since California has the highest number of accidents, we decided to view its distribution. As expected, it has a similar distribution to the total distribution. This couldbe because California makes up a large proportion of the data. Columbus city has a similar distribution. This shows that generally, junctions tend to be the location where most accidents occur.
To get a better view on the number of accidents per state, a heatmap was created and is shown below. Here, we can better easily that other states share this behaviour having highest accidents number of accidents at junctions.
fig, axes = plt.subplots(1,2, figsize = (15,6))
Cali_df = df[df['State'] == 'California']
Cali_df[['Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']].sum().plot(ax = axes[0],kind='barh',
title = 'Number of accidents per location (California State)');
axes[0].set_xlabel('Number of accidents')
Cali_df = df[df['City'] == 'Columbus']
Cali_df[['Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']].sum().plot(ax = axes[1], kind='barh',
title = 'Number of accidents per location (Columbus City)');
axes[1].set_xlabel('Number of accidents');
state_loc_df = df[['State', 'Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']]
state_loc_piv = state_loc_df.pivot_table(index = 'State', values = ['Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop'], aggfunc='sum')
fig, axes = plt.subplots(figsize = (15,10))
sns.heatmap(state_loc_piv, cmap = plt.cm.RdPu);
We are interested in seeing whether vehicle ownership per state is related to the number of accidents. As mentioned above, the Barnes Firm states that States with a higher population have more vehicles on the road, we expect to see a correlation and California to have the largest number of vehicles.
The latest dataset that was found for vehicle per capita was for the year, 2017. This was webscraped. Given that the data is similar to 2015, we assume that 2019 will be roughly the same as 2017. Also, note that the data records only light vehicles.
Interestingly, we find that California does not come up on top. In fact, Wyoming is seen to have the highest number of vehicles per capita. This could be because larger states are more likely to have public transport systems, lessening the need for people to have their own vehicles. Wyoming on the other hand, has one of the smallest populations, and therefore there may not be accessible public transport, giving the need for more private vehicle ownership.
Note that Wyoming is not in our main datset,indicating just how few accidents they have, if any.
# Webscrape vehicle ownership per State in 2017
url = "https://en.wikipedia.org/wiki/List_of_U.S._states_by_vehicles_per_capita"
table_class= "wikitable sortable jquery-tablesorter"
response = requests.get(url)
print(response.status_code)
res = requests.get("https://en.wikipedia.org/wiki/List_of_U.S._states_by_vehicles_per_capita")
soup = BeautifulSoup(res.content, "html.parser")
car_own = soup.find_all('table', {'class':"wikitable sortable mw-collapsible"})
df2 = pd.read_html(str(car_own[1]))
df2 = pd.DataFrame(df2[0])
Next, the vehicle count was merged with the main dataframe and compared to the accident count. Glancing at the table below, we cannot really see a correlation between the two variables. For example, Iowa has the highest number of vehicle ownership of 1050, with a relatively low number of accidents. On the other hand, California has an accident count of 960, even though their vehicle ownership is 840.
The plot below, visualises the proportion of vehicle ownership and accident count. Here we can see more clearly the distributions.
A scatter plot was also created to see whether there is a correlation between the two variables, and again we can see that there was no apparent linear relationship between vehicle ownership and the number of accidents per state. This is confirmed by the correlation calculated below, showing a value of -0.07, almost 0! Therefore, we can conclude that there is no correlation.
df3 = df.merge(df2, how = 'left', left_on='State', right_on= 'State')
veh_piv = df3.pivot_table(index = ['State', 'Vehicles per 1000 People'], values = 'ID', aggfunc = 'count')
veh_piv.rename(columns={'ID':'Accident count'}, inplace = True)
veh_piv.reset_index(inplace = True)
veh_piv.sort_values(by = 'Vehicles per 1000 People',ascending = False).head()
ax, fig = plt.subplots(figsize = (30,8))
plt.bar(veh_piv['State'], veh_piv['Vehicles per 1000 People'])
plt.bar(veh_piv['State'], veh_piv['Accident count'], bottom=veh_piv['Vehicles per 1000 People'])
plt.show()
veh_piv.reset_index(inplace =True)
fig, ax = plt.subplots(figsize= (30,20))
plt.scatter(veh_piv['Accident count'], veh_piv['Vehicles per 1000 People'])
states = veh_piv['State']
texts = []
for i, name in enumerate(states):
texts.append(ax.text(veh_piv['Accident count'][i],veh_piv['Vehicles per 1000 People'][i], name))
adjust_text(texts)
plt.rcParams.update({'font.size': 30})
ax.set_title("Vehicle ownership vs accident count",fontsize=30)
ax.set_xlabel("Vehicles per 1000 People",fontsize=20)
ax.set_ylabel("Accident count",fontsize=20)
plt.show()
states_count = veh_piv[['State', 'Accident count']]
states_count.set_index('State', inplace = True)
veh_piv.set_index('State', inplace = True)
veh_piv.corr()
Scrap traffic data from Google Maps. The data is about the time needed for a car to go from the Start point of the accident to the End point of the accident at the time of the accident and 30 mins before.
The code for extracting this data is commented out, because each user is aloud to make a limited amount of free queries, hence this code shouldn't run twice with the same api key. The data was saved in a csv file, which is read below.
(Note: This traffic data was retrieved for about 1000 accidents and not for all the accidents we are analysing in this project, as the google maps api calls required credits after a specific amount of requests.)
Information about the API used to retrieve Google Maps Directions can be found here: https://app.outscraper.com/api-docs
# #Store the API key into a python object
# api_client = ApiClient(api_key='YXV0aDB8NjFiYjkyNGNiNmZkMzAwMDZiMDZiMDNlfDMzZDUyZTQ5ZDk')
# #Get a subset of rows of the main dataset and select only the columns needed fro scrapping google maps data
# accidents_sample_df = accidents_df.iloc[0:1200 , 2:8]
# accidents_sample_df
# #Scrap traffic data from google maps, regarding the time needed to go from the accident start point
# #before and during the accident
# #Loop through each row of the dataframe
# for index, row in accidents_sample_df.iterrows():
# #store the start and end coordinates in a format that googel maps accepts
# coords = np.array([ str(row['Start_Lat']) + ', ' + str(row['Start_Lng']) + ' ' + str(row['End_Lat']) + ',' + str(row['End_Lng'])])
# #---------At the accident time:
# #Transform the value of the Start_Time column into datetime and then into timestamp
# dttm = datetime.strptime(row['Start_Time'], '%Y-%m-%d %H:%M:%S')
# timestamp = int(datetime.timestamp(dttm))
# try:
# google_maps = api_client.google_maps_directions(coords, departure_time = timestamp)
# gm_dict =google_maps[0] #get the dictionary from the list of 1 element
# accidents_sample_df.loc[index, 'duration']= gm_dict['duration(minutes)']
# accidents_sample_df.loc[index,'duration_max']= gm_dict['duration_max(minutes)']
# accidents_sample_df.loc[index,'speed']= str(gm_dict['road_distance_timing(meters:seconds)'])
# print(index)
# except: # takes care of the case when the API call will not be successful
# print (coords, dttm, ' did not work')
# #----------30 mins before the accident time:
# timestamp_30bef = timestamp - 1800
# try:
# google_maps = api_client.google_maps_directions(coords, departure_time = timestamp_30bef)
# gm_dict =google_maps[0] #get the dictionary from the list of 1 element
# accidents_sample_df.loc[index, 'duration_30bef']= gm_dict['duration(minutes)']
# accidents_sample_df.loc[index,'duration_max_30bef']= gm_dict['duration_max(minutes)']
# accidents_sample_df.loc[index,'speed_30bef']= str(gm_dict['road_distance_timing(meters:seconds)'])
# print(index)
# except:
# print (coords, dttm, ' did not work')
# #save dataframe to csv file
# accidents_sample_df.to_csv('gm_all.csv', index = True)
# gm_all_df
#Load google maps data that were generated before:
gm_all_df = pd.read_csv ('gm_all.csv')
gm_all_df.head()
The columns speed and speed_30bef were extracted from Google Maps and are in the format of a sting - dictionary. Below we convert them into meaninglful measurements about the speed. In particular, Google Maps returned to us in a dictionary format the distance and the time needed to cover this distance in different location points within the start and end location point of the accident. Below, using this information, we calculate the maximum speed that a car could move within the accident area and create the columns speed_max and speed_30bef_max. The first column refers to the maximum speed at the time of the accident and the second column to the maximum speed 30 mins before the accident. The difference of these 2 variables is also calculated and stored in a new column called speed_max_diff.
The way that speed difference is calculated is so that when its value is positive it represents an increase in teh speed at the time of the accident.
#Transform the speed and speed_30bef columns, which are in the format of a sting - dictionary into meaninglful measurements about the speed
#Iterate through the rows of the dataframe
for index, row in gm_all_df.iterrows():
#If the API call to google maps was not successful and the result is Nan then skip this iteration
#and continue to the next
if pd.isnull(gm_all_df.loc[index, 'speed']):
continue
if pd.isnull(gm_all_df.loc[index, 'speed_30bef']):
continue
#Create a function that accepts a column name, which has the speed in dictionary format saved,
#extracts its values and creates an array with the speed in different distance points:
def get_speed (column_name):
#Extract the value from the speed column of the particular row and save it in an object:
speed1 = gm_all_df.loc[index, column_name]
#The next function needs double quotes around keys and values, so we replace single quotes with double
speed1_string = speed1.replace("'", "\"")
#Transform string into dictionary
speed1_dict = json.loads(speed1_string)
#Remove the items from the dictionary with zero time. This will facilitate the numeric calculations (divide) later on
for k, v in list(speed1_dict.items()):
if v == 0:
del speed1_dict[k]
#Get the keys of the dictionary into a list. The keys represent the meters. Their format is string.
speed1_meters_str= list(speed1_dict.keys())
#Perform conversion of the string elements in the list into integers (meters) using list comprehension
speed1_meters_num = [int(i) for i in speed1_meters_str]
#Get the values of the dictionary into a list.
#The values represent the seconds needed for a car to cover the distance mentioned in the keys of the dictionary
speed1_seconds_num = list(speed1_dict.values())
#Transform the lists into numpy arrays in order to perform division of their elements and store them into a new numpy array.
#Dividing meters/seconds returns the speed.
a = np.array(speed1_meters_num)
b = np.array(speed1_seconds_num)
speed_array = np.divide(a,b)
# Return the array with the speed in different distance points for the particular accident at the particular time
return speed_array
#Get speed array for the column 'speed':
speed_at_accident = get_speed(column_name = 'speed')
if len(speed_at_accident) == 0 :
continue
#Calculate maximum speed, round it to 2 decimal places and save it in the dataframe as new column
gm_all_df.loc[index, 'speed_max'] = np.around(np.max(speed_at_accident), 2)
#Get speed array for the column 'speed_30bef':
speed_30bef_accident = get_speed(column_name = 'speed_30bef')
if len(speed_30bef_accident) == 0 :
continue
#Calculate maximum speed, round it to 2 decimal places and save it in the dataframe as new column
gm_all_df.loc[index, 'speed_30bef_max'] = np.around(np.max(speed_30bef_accident), 2)
#Convert duration from minutes to seconds
gm_all_df.loc[index, 'duration'] = gm_all_df.loc[index, 'duration'] * 60
gm_all_df.loc[index, 'duration_max'] = gm_all_df.loc[index, 'duration_max'] * 60
gm_all_df.loc[index, 'duration_30bef'] = gm_all_df.loc[index, 'duration_30bef'] * 60
gm_all_df.loc[index, 'duration_max_30bef'] = gm_all_df.loc[index, 'duration_max_30bef'] * 60
gm_speed_df = gm_all_df.drop(['speed', 'speed_30bef', 'Start_Time', 'End_Time' , 'Start_Lat' , 'Start_Lng' , 'End_Lat' , 'End_Lng'], axis = 1)
gm_speed_df= gm_speed_df.set_index('ID')
gm_speed_df.head()
#Combine google maps and accidents datasets
accidents_gm_speed_df = pd.concat([accidents_clean_df, gm_speed_df,], join = 'inner', axis = 1)
#drop rows with missing values
accidents_gm_speed_df = accidents_gm_speed_df.dropna()
#Get the name of the day and the hour
accidents_gm_speed_df['Day'] = accidents_gm_speed_df['Start_Time'].dt.day_name()
accidents_gm_speed_df['Hour'] = accidents_gm_speed_df['Start_Time'].dt.hour
#List of different binary columns we want to use to split the data
bin_cols =[ 'Crossing' ,'Junction', 'Amenity', 'Stop', 'Traffic_Signal']
accidents_gm_speed_df['Weekend'] = accidents_gm_speed_df['Day'].apply(lambda x: 'Yes' if x in ['Saturday', 'Sunday'] else 'No')
for c in bin_cols:
accidents_gm_speed_df[c] = accidents_gm_speed_df[c].apply(lambda x: 'Yes' if x ==1 else 'No')
#Calculate the difference of the maximum speed 30 misn before the acident and at the time of the accident.
accidents_gm_speed_df['speed_max_diff']= accidents_gm_speed_df['speed_max'].sub(accidents_gm_speed_df['speed_30bef_max'],
axis=0)
#If speed_max_diff > 0 then the speed at the time of the accident is higher than before
#If speed_max_diff < 0 then the speed is lower at the time of the accident
The traffic data, extracted from Google Maps are going to be explored to answer the following questions:
Do the road conditions, like the existance of a crossing, or the day of the week (weekend or not) affected positively or negatively the difference in the speed that a car could move after the accident?
To answer these questions we first plot the distribution of the maximum speed per severity level and the distribution of the speed difference per severity level. Afterwards, we plot violin plots for the distribution of the speed difference by severity level, splitting the data based on a different column each time, in order to investigate the impact that road conditions or the day of the week have. The attributes that we investigate relative to the speed difference are: Crossing ,Junction, Amenity, Stop, Traffic_Signal and Weekend.
#Plot the distribution of the speed per severity and the speed difference per severity
fig, ax = plt.subplots(1, 2, figsize=(14,6))
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==2]['speed_max'].plot(legend=True, kind = 'density', ax = ax[0])
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==3][ 'speed_max'].plot(legend=True, kind = 'density', ax = ax[0])
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==4][ 'speed_max'].plot(legend=True, kind = 'density', ax = ax[0])
ax[0].set_title('Speed distribution', fontsize=15, y =1.1)
ax[0].set_ylabel('Density', fontsize=15)
ax[0].set_xlabel('Speed (seconds/meter)', fontsize=15)
ax[0].xaxis.set_label_coords(1.1, -0.1)
ax[0].yaxis.set_label_coords(-0.2,0.5)
ax[0].legend(labels =['Severity 2','Severity 3','Severity 4'], loc='top left', fontsize=11)
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==2]['speed_max_diff'].plot(legend=True, kind = 'density', ax = ax[1])
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==3]['speed_max_diff'].plot(legend=True, kind = 'density', ax = ax[1])
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==4]['speed_max_diff'].plot(legend=True, kind = 'density', ax = ax[1])
ax[1].set_title('Speed Difference distribution', fontsize=15, y =1.1)
ax[1].set(ylabel=None)
ax[1].legend(labels =['Severity 2','Severity 3','Severity 4'], loc='top left', fontsize=11)
plt.show()
Observations:
From the speed density plot above we see that the maximum speed that a car needs to cover the area of the accident at the time of the accident, based on Google Maps data, follow a normal distribution with a very similar behaviour for accidents with severities 2 & 3, where the median speed is around 30 seconds per meter. Accidents with severity 4 differ, as their median speed is a bit lower, around 20 seconds per meter, opposite to what we were expecting. We also observe that severity 4 accidents have a higher spread of their speed value. This might mean that usually the low severity accidents happen in roads with a common speed limit.
From the speed difference density plot above we see that all accidents follow a very similar behaviour regarding the speed difference before and during the accident, where the median value is close to 0, meaning that usually there is no big speed difference. However, the tail of the severity 2 accidents is long but thin, implying the presence of some outliers. To facilitate the next visualisations, we remove the accidents with a speed difference that is more than 3 standard deviations away from the mean. In that way we can focus our analysis on the majority of the accidents.
'''Standardize speed differences and remove the accidents with a speed difference
more than 3 standard deviations away from the mean.'''
data = accidents_gm_speed_df['speed_max_diff']
data_z = (data-data.mean())/(data.std())
gm_speed_df_3stds = accidents_gm_speed_df[(np.absolute(data_z) < 3) ]
bin_cols.append("Weekend")
sns.reset_orig()
f, axes = plt.subplots(2,3, figsize=(15,15), sharey=True, sharex=True)
for i, c in enumerate(bin_cols): #iterate through the columns based on which we want to split the data
#set the correct values for the axes indeces:
if i <3 :
row = 0
j = i
else :
j =i-3
row =1
sns.violinplot(x="Severity", y="speed_max_diff", data=gm_speed_df_3stds, hue = c, points=100,
palette="RdBu", split=True, ax=axes[row,j]).set_title('Accidents split by ' + c)
axes[row,j].legend(bbox_to_anchor=(0.9, 1), loc=2, borderaxespad=0., title=c, fontsize =12, title_fontsize=12).get_frame().set_alpha(None);
axes[row,j].set(xlabel=None)
axes[row,j].set(ylabel=None)
plt.suptitle('Maximum Speed Difference vs Severity', fontsize = 22);
f.subplots_adjust(wspace = 0.5); #adjust the plots so that the first column has some space from the second
axes[0,0].set_xlabel("Severity", size=20)
axes[0,0].set_ylabel("Difference in the maximum Speed", size=20)
axes[0,0].xaxis.set_label_coords(2, -1.3)
axes[0,0].yaxis.set_label_coords( -0.2, -0.1)
Observations:
From the accidents split by crossing violin plot we understand that the big majority of accidents that occured at a Crossing resulted in the highest severity level. We also understand that the speed difference for these accidents follow a very different distribution with very heavy tails, meaning that when an accident occured at a crossing with severity 4, the traffic was affected with a high variability, meaning that the speed could be increased or reduced with a high variability.
A high variability in the speed difference (although not as high) is observed again for severity 4 accidents, when a traffic signal was at the accident location. However, we see that the traffic signal influences positively the trafic, meaning that when it's present, usually the speed at which the cars can move increases. This is derived by the right-skewed distribution, where Severity = 4 and Traffic_Signal = Yes (Blue). The intuition behind that could be that traffic signal facilitate the traffic and hence there is less need for a human intervation to control the traffic, compared to road where there is not traffic signal.
High variability in the speed difference at severity 4 accidents is also seen on Weekends. This could be explained by the fact that on weekends the police staff is less and more people are going for trips or in places they haven't been before. This of course can make the speed vary more.
It is also interesting to observe from the accidents split by amenity violin plot, that almost none of the highly severe accidents happend close to an amenity. Also, the severe 2 accidents that occured next to an amenity clearly affected negatively the traffic, by reducing the speed that a car could move. This can be intuitively explained by the fact that usually when amenities are on a road it means that the road is in a residential area and probably small, and hence an accident might block a big part of it.
When the accidents in US are recorded, a Description of them is also recorded. Obviously the Description is created after the accident, so it will not be used for the purpose of the Severity prediction. However, we would like to explore the features of the Description in relation to the Severity to understand whether a different format is followed in each severity level and whether information about the features of the accident can be extracted, that help us understand more what the different severity levels represent.
For a start, we print one random example of the Description column from each Severity level, in order to have an overall understanding of the text.
description_df = accidents_clean_df[['Description', 'Severity']]
#drop missing values if any on the description column
#description_df= description_df.dropna()
#Get the value of one Description from each severity level to display as an example
desc2_str = description_df.loc[description_df['Severity'] == 2,'Description'][100] #100 is just a random number
desc3_str = description_df.loc[description_df['Severity'] == 3,'Description'][100]
desc4_str= description_df.loc[description_df['Severity'] == 4,'Description'][100]
print("\n")
print(100 * '^')
print('Severity 2: '+ Fore.GREEN + desc2_str)
print(Fore.BLACK + 'Severity 3: '+ Fore.BLUE + desc3_str)
print(Fore.BLACK + 'Severity 4: '+ Fore.RED + desc4_str)
print(Fore.BLACK +100 * '^')
print("\n")
The first observation, is that Severity 2 level accidents are described by the location of the accident, Severity 3 by a wider area and severity 4 by the consequences of the accident on the traffic. However, this observation might be random, and therefore we perform a more detailed analysis below.
We calculate and plot the average number of words that are used for the description of an accident for each severity level.
#Get the number of words
description_df['desc_word_count'] = description_df['Description'].apply(lambda x: len(str(x).split()))
#Calculate and plot the average number of words that appear in the Desciption column, per Severity level
descr_agg_df = description_df.groupby('Severity')['desc_word_count'].agg('mean')
fig, ax = plt.subplots( figsize=(15, 4))
descr_agg_df.sort_values(ascending=True).head(3).plot(kind='bar', color=(['g','b','r']), alpha=0.6)
ax.set_ylabel('Number of words', fontsize =14)
ax.set_xlabel('Severity', fontsize =14)
ax.set_xticklabels(['2', '3', '4'], rotation=0, fontsize='medium');
plt.title('Average number of words in the Description of each accident, for each severity level',fontsize=16, y =1.1);
From the bar plot above is clear that accidents with severities 2 and 3 are described with a similar number of words (around 7), whereas accidents with a severity of 4 are described with double words.
For the next step of the text analysis we will consider as our corpus the Description values for each severity level separately, split the text in monograms (words) and count the number of accident descriptions that each word appeared (not the total number of times the word appeared). We will not remove stopwords, like and because as seen from the Description examples above, the text has a semi-structured format, where words like and or at can reveal the type of accident. For example the word and shows that there were more than one road involved in the accident.
We will then transform the "word count" into "word frequency" by dividing the total number of descriptions that a word occured, with the total number of accidents (for a severity level). We will use the "word frequency" to plot word clouds and a bar plot for the frequency of the top 16 words.
It was seen that the word accident was used in every description, and hence was eliminated from the analysis as it wouldn't provide any insight.
'''Create a function that splits the text in monograms (words) and count the number of accident descriptions
that each word appeared (not the total number of times the word appeared) and returns a dictionary with the results.'''
def get_top_n_words(corpus, n=None):
vec = CountVectorizer(binary=True).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_count = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_count =sorted(words_count, key = lambda x: x[1], reverse=True)[:n]
word_dic = {}
for word, count in words_count:
word_dic[word] = count
del word_dic["accident"] # Delete accident, which is present in all records
return word_dic
word_dic_sev2 = get_top_n_words(description_df.loc[description_df['Severity'] == 2,'Description'], 16)
word_dic_sev3 = get_top_n_words(description_df.loc[description_df['Severity'] == 3,'Description'], 16)
word_dic_sev4 = get_top_n_words(description_df.loc[description_df['Severity'] == 4,'Description'], 16)
# Generate word cloud images for different severities
fig = plt.figure(figsize = (16,20))
for c in (2,3,4): #for the different values of severity
wordcloud= WordCloud(background_color="AliceBlue",
contour_width=5, contour_color='steelblue'
).generate_from_frequencies(globals()[f'word_dic_sev{c}'])
plt.subplot(6,3, (c-1)).set_title("Severity " + str(c), fontsize= 15)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
fig.suptitle('Word Cloud images per Severity', fontsize= 22, y = 0.95)
plt.show()
#Print multiple empty lines to make multiple graph display clean
print(2 * "\n")
print(115 *'^')
print(2 * "\n")
#Plot a bar chart with the most common words for each severity level
def get_word_freq(sev):
#Transform dictionary into a dataframe
data_items = globals()[f'word_dic_sev{sev}'].items()
data_list = list(data_items)
word_df = pd.DataFrame(data_list)
word_df.columns =['word', 'count'] #name the columns of the dataframe
word_df = word_df.set_index(['word']) #set index to word
#Count the number of accidents for each severity level
n_accidents_sev = description_df.loc[description_df['Severity'] == sev].shape[0]
#calculate the frequency of each word based on the count and the number of accidents each severity level has,
#so that the number next to each word can be comparable across different severityt levels
word_df['freq']= word_df['count'].apply(lambda x: np.around((x/ n_accidents_sev),decimals = 2))
word_df= word_df.drop(['count'], axis= 1)
return word_df
for sev in (2,3,4):
globals()[f'word_sev{sev}_df']= get_word_freq(sev)
word_freq_all_df = pd.concat([word_sev2_df, word_sev3_df, word_sev4_df], axis=1, join ='outer', ignore_index =True)
word_freq_all_df.columns =['Severity 2', 'Severity 3', 'Severity 4']
for sev in (2,3,4):
word_freq_all_df['Severity ' + str(sev)] = word_freq_all_df['Severity ' + str(sev)].fillna(0)
word_freq_all_df.sort_values(by=['Severity 2', 'Severity 3', 'Severity 4'], ascending=False)
#bar chart
ax = word_freq_all_df.plot.bar(edgecolor='none', alpha = 0.5, figsize=(14,4), fontsize=14, color =('g','b', 'orange'));
ax.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.,fontsize=12)
ax.set_xlabel('Words', fontsize =17);
ax.set_ylabel('Frequencies' , fontsize =17)
ax.set_title('Most common words on the Description', fontsize = 24, pad =35, y =1.05)
plt.suptitle('(sorted by severity 2,3,4)', y=1.01, fontsize=14);
Observations:
From the Word Cloud we see that the words exit and at have the highest frequency for severities 2 and 3, whereas for the accidents with severity 4 the words closed, road, to and due apear more often. This is also confirmed from the bar plot, where severity 2 and 3 follow a similar pattern, opposite to severity 4.
From these visualisations, we can conclude that accidents are labeled with a severity level of 4, when a road needs to close because of the accident. We also see that accidents with a severity of 2 or 3 often happen at a highway exit.
The categorical variables of City, State and Weather_Condition have a high granularity level of 813, 20 and 23 respectively, as seen from the initial Data Exploration part. To perform predictive modelling, some algorithms can only handle numeric variables. Transforming these categorical variables into dummies (numeric) can add a lot of complexity in the models, as it will require that just for the City variable we will get 813 new dummy variables. To address this issue and at the same time find a way to include the City, State and Weather_Condition information in the model, we will convert these categorical data into dummy variables and then try to reduce their high dimensionality into 1, 2 or 3 numeric components. However, we are aware that this process will reduce the interpretability power of the models that will be created, but accuracy will be improved.
Therefore, in this part of the project, we perform the following actions:
City and State variables and 1, 2 and 3 components to represent the Weather_Condition variable. #Create objects that will be needed later
Y_true = accidents_clean_df['Severity'].to_numpy() #save the label of severity in an array
models =("PCA", "MDS", "Isomap", "SpEmb", "LLE", "tSNE")
n_rows = len(Y_true) #number of rows
#list of colors we want to use in the scatter plots
cmap = LinearSegmentedColormap.from_list('severities', ['turquoise', 'darkviolet','red'])
sev = ['2','3','4'] #unique severity levels
#Create a function that will run the dimesionality reduction techniques for the given dataset and number of components.
def dimensionality_red_methods(X, comp): #input: dataset to reduce its dimentionality and number of final components
#PCA
clf = decomposition.TruncatedSVD(n_components=comp) # This is the PCA model. we want to reduce the dimensionality
X_PCA = clf.fit_transform(X-X.mean(axis=0)) #We fit the molde after centering the data to 0
#MDS
clf = manifold.MDS(n_components=comp, n_init=1, max_iter=500)
X_MDS = clf.fit_transform(X)
#Isomap
n_neighbors = 30
clf = manifold.Isomap(n_neighbors=n_neighbors, n_components=comp)
X_Isomap = clf.fit_transform(X)
#Spectral Embedding
n_neighbors = 30
clf = manifold.SpectralEmbedding(n_components=comp, random_state=0, eigen_solver="arpack", affinity='nearest_neighbors', n_neighbors=n_neighbors)
X_SpEmb = clf.fit_transform(X)
#LLE
clf = manifold.LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=comp, method='standard')
X_LLE = clf.fit_transform(X)
#t-SNE
clf = manifold.TSNE(n_components=comp, init='pca', random_state=0, n_iter=500)
X_tSNE = clf.fit_transform(X)
return X_PCA, X_MDS, X_Isomap, X_SpEmb, X_LLE, X_tSNE
Transform City and State variables into binary dummy variables.
location_dummies_df = pd.get_dummies(accidents_clean_df[['City', 'State']], columns=['City', 'State'], prefix_sep='_')
X_city_dummies = location_dummies_df
X_city_dummies = X_city_dummies.to_numpy()
Run the 6 dimensionality reduction methods for the combination of the City and State variables, for 1, 2 and 3 components.
drm_city_1 = dimensionality_red_methods(X_city_dummies, 1)
drm_city_2 = dimensionality_red_methods(X_city_dummies, 2)
drm_city_3 = dimensionality_red_methods(X_city_dummies, 3)
'''Create a matrix for each group of embeddings that is produced.
For example, the matrix X1_city_PCA will store the 1 component produced from the PCA method for the City/State variables.
The matrix X2_city_PCA will store the 2 components produced from the PCA method for the City/State variables.'''
topic = 'city'
for comp in range(1,4):
for i, m in enumerate(models):
X = globals()[f'drm_{topic}_{comp}'][i]
# normalize the data to be in the range [0,1]
x_min, x_max = np.min(X, 0), np.max(X, 0)
globals()[f'X{comp}_{topic}_{m}'] = (X - x_min) / (x_max - x_min)
Calculate the correlation of the new City/State related embeddings with Severity. We will try to use the correlations to decide on the more useful embeddings to be used in the predictive modelling for severity.
def correlation_of_embeddings_with_severity (topic):
for comp in range(1,4):
df = pd.DataFrame() #initialize the dataframe to store the results
for i, m in enumerate(models):
if comp == 1:
X_Y = np.concatenate([globals()[f'X{comp}_{topic}_{m}'].reshape(n_rows,1),
Y_true.reshape(n_rows,1)],
axis =1)
if comp ==2:
X_Y = np.concatenate([globals()[f'X{comp}_{topic}_{m}'][ :, 0].reshape(n_rows,1), #globals()[f'X{comp}_{m}'] is for X2_PCA
globals()[f'X{comp}_{topic}_{m}'][ : ,1].reshape(n_rows,1) ,
Y_true.reshape(n_rows,1)],
axis =1)
if comp ==3:
X_Y = np.concatenate([globals()[f'X{comp}_{topic}_{m}'][ :, 0].reshape(n_rows,1), #globals()[f'X{comp}_{m}'] is for X2_PCA
globals()[f'X{comp}_{topic}_{m}'][ : ,1].reshape(n_rows,1) ,
globals()[f'X{comp}_{topic}_{m}'][ : ,2].reshape(n_rows,1) ,
Y_true.reshape(n_rows,1)],
axis =1)
X_Y_df = pd.DataFrame(X_Y)
corr = np.around(X_Y_df.corr(method ='pearson'),decimals= 2)
df[m] = corr.iloc[0:comp,comp]
display(df)
correlation_of_embeddings_with_severity('city')
Observations:
It can be seen that the 1-component embedding from Isomap has the highest correlation (0.24) with severity, so we will try that variable for the predictive model (X1_city_Isomap).
From the 2-component embeddings it is harder to decide based on correlation, as the city information is shared across 2 variables. Isomap has the highest correlation in the first component, but a very low correlation in the second. LLE on the other hand has a very high correlation in both. To facilitate the decision regarding the 2 components, scatter plot will be shown after.
From the 3-component embeddings LLE and t-SNE have interesting correlations.
Below we plot embeddings of 2 components for City/State:
def scatter_model(row, col, model, model_emb):
sc = ax[row,col].scatter(model_emb[:,0], model_emb[:,1], s=17,
c = accidents_clean_df['Severity'].astype('category').cat.codes, cmap=cmap)
ax[row,col].set_title(model, fontsize = 24)
return sc
fig, ax = plt.subplots(2, 3, figsize=(17,14), sharey=True, sharex=True)
scatter = scatter_model(0,0,'t-SNE', X2_city_tSNE)
scatter_model(0,1,'PCA', X2_city_PCA)
scatter_model(1,0,'Isomap', X2_city_Isomap)
scatter_model(1,1,'MDS', X2_city_MDS)
scatter_model(0,2,'LLE', X2_city_LLE)
scatter_model(1,2,'SpEmb', X2_city_SpEmb)
ax[0,0].set_xlabel("Embedding 1", size=20)
ax[0,0].set_ylabel("Embedding 2", size=20)
ax[0,0].xaxis.set_label_coords(1.7, -1.3)
ax[0,0].yaxis.set_label_coords( -0.1, -0.14)
# add legend to the plot with names
ax[0,0].legend(handles=scatter.legend_elements()[0],
labels=sev,
title="Severities", bbox_to_anchor=(-0.35, 1.3), loc=2, borderaxespad=0., fontsize =16, title_fontsize=16);
fig.suptitle('City/State Embeddings', fontsize = 24, y =1.0); #Main title
t-SNE plot is very interesting as cluster of the severity 4 accidents (red) can be identified. The correlation calculated before captured the linear relationship of each component with severity, but this plot reveals some relationship with severity, but not clearly linear. So, these 2 components derived from t-SNE will be passed to the modelling part to test whether they perform well in predicting severity.
The rest of the plots are not very helpful in identifying a relationship with severity, so the components derived from the rest of the methods will not be used.
Transform Weather_Condition variable into binary dummy variables.
weather_dummies_df = pd.get_dummies(accidents_clean_df['Weather_Condition'], columns=['Weather_Condition'], prefix_sep='_')
X_weather_dummies = weather_dummies_df
X_weather_dummies = X_weather_dummies.to_numpy()
Run the 6 dimensionality reduction methods for the combination of the Weather_Condition variable, for 1, 2 and 3 components.
drm_weather_1 = dimensionality_red_methods(X_weather_dummies, 1)
drm_weather_2 = dimensionality_red_methods(X_weather_dummies, 2)
drm_weather_3 = dimensionality_red_methods(X_weather_dummies, 3)
topic = 'weather'
for comp in range(1,4):
for i, m in enumerate(models):
X = globals()[f'drm_{topic}_{comp}'][i]
# normalize the data to be in the range [0,1]
x_min, x_max = np.min(X, 0), np.max(X, 0)
globals()[f'X{comp}_{topic}_{m}'] = (X - x_min) / (x_max - x_min)
#The above stores the results of the function into variables named like X1_weather_PCA, X2_weather_PCA,
#depending on the number of components and the method followed
Calculate the correlation of the new Weather_Condition related embeddings with Severity.
correlation_of_embeddings_with_severity('weather')
Observations:
Embeddings that represent weather condition information have a lower correlation in general than city/state embdeddings.
It can be seen that the 1-component embedding from Isomap has the highest correlation (-0.13) with severity, so we will try that variable for the predictive model (X1_weather_Isomap).
From the 2-component embeddings it is harder to decide based on correlation. Isomap and PCA have probably the best correlation among all. To facilitate the decision regarding the 2 components, scatter plot will be shown after.
From the 3-component embeddings we see that correlations continue to be very low and even lower, so we decide not to use any of these embeddings.
Below plot embeddings of 2 components for Weather Condition:
def scatter_model(row, col, model, model_emb):
sc = ax[row,col].scatter(model_emb[:,0], model_emb[:,1], s=17,
c = accidents_clean_df['Severity'].astype('category').cat.codes, cmap=cmap)
ax[row,col].set_title(model, fontsize = 24)
return sc
fig, ax = plt.subplots(2, 3, figsize=(17,14), sharey=True, sharex=True)
scatter = scatter_model(0,0,'t-SNE', X2_weather_tSNE)
scatter_model(0,1,'PCA', X2_weather_PCA)
scatter_model(1,0,'Isomap', X2_weather_Isomap)
scatter_model(1,1,'MDS', X2_weather_MDS)
scatter_model(0,2,'LLE', X2_weather_LLE)
scatter_model(1,2,'SpEmb', X2_weather_SpEmb)
ax[0,0].set_xlabel("Embedding 1", size=20)
ax[0,0].set_ylabel("Embedding 2", size=20)
ax[0,0].xaxis.set_label_coords(1.7, -1.3)
ax[0,0].yaxis.set_label_coords( -0.1, -0.14)
# add legend to the plot with names
ax[0,0].legend(handles=scatter.legend_elements()[0],
labels=sev,
title="Severities", bbox_to_anchor=(-0.35, 1.3), loc=2, borderaxespad=0., fontsize =16, title_fontsize=16);
fig.suptitle('Weather Embeddings', fontsize = 24, y =1.0);#Main title
No clear correlation with severity can be seen in any of the above scatter plots.
Based on our analysis before, we select a few of the created embeddings to feed the modelling part:
#Save the embeddings that we want to try in predictive modelling, into a dataframe
IDs = accidents_clean_df.index
embeddings_df = pd.DataFrame(index = IDs)
embeddings_df['city_emb1_Isomap_1'] , \
\
embeddings_df['city_emb2_LLE_1'], \
embeddings_df['city_emb2_LLE_2'], \
\
embeddings_df['city_emb2_tSNE_1'], \
embeddings_df['city_emb2_tSNE_2'], \
\
embeddings_df['city_emb3_LLE_1'], \
embeddings_df['city_emb3_LLE_2'], \
embeddings_df['city_emb3_LLE_3'], \
\
embeddings_df['weather_emb1_Isomap_1'],\
\
embeddings_df['weather_emb2_PCA_1'], \
embeddings_df['weather_emb2_PCA_2'] =\
\
[X1_city_Isomap,\
X2_city_LLE[:,0], X2_city_LLE[:,1], \
X2_city_tSNE[:,0], X2_city_tSNE[:,1], \
X3_city_LLE[:,0], X3_city_LLE[:,1], X3_city_LLE[:,2], \
\
X1_weather_Isomap, \
X2_weather_PCA[:,0], X2_weather_PCA[:,1] ]
embeddings_df.head()
Having explored the data, we would like to see if we can predict the severity of the accidents given the features. We would also like to see which variables are most influential.
df['day (int)'] = df['Start_Time'].dt.weekday
df['day (int)'].nunique()
# Change Bool columns to 0 and 1
columns = ['Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']
for c in columns:
df[c] = df[c].astype(int)
df.reset_index(drop = True, inplace = True)
Since the data is highly skewed, we will use Precision-recall curves to evaluate the models. Moreover, since the data is imbalanced accuracy will yield misleading results, predicting the majority class, which is a reflection on the distribution. This is known as the accuracy paradox [5]. Therefore we will not be using this measure to compare the models.
To model the data, variables that are not a float or integer will need to be removed, unless turned into dummy variables. In addition, booleans are transformed into integers, 0 and 1.
We first explore the multiclass case. Precision-recall curves are usually used in binary classification. Therefore, in order to use it for multi-classclassification, the output must be binarized. This is done using the python function label_binarize().
From the precision-recall plot produced below, we can immediately see that Severity 2 has the best score, showing that it would yield the most accurate results. This could be because it has far more data points than the other labels. Interestingly, Severity 4 has a much higher score than 3.
y = df['Severity']
n_class = [2,3,4]
Y = label_binarize(y, classes=[2,3,4])
X = df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition',
'State abbreviation'])
X_train, X_test, y_train, y_test = train_test_split(X,Y,
random_state = 33)
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=50,
max_depth=3,
random_state=0))
clf.fit(X_train, y_train)
y_score = clf.predict_proba(X_test)
# precision recall curve
fig, ax = plt.subplots(figsize = (15,6))
precision = dict()
recall = dict()
for i, l in enumerate(n_class):
precision[i], recall[i], _ = precision_recall_curve(y_test[:, i],
y_score[:, i])
avg_precision = average_precision_score(y_test[:,i], y_score[:, i])
plt.plot(recall[i], precision[i], lw=2, label='class {} (area={})'.format(l,round(avg_precision,2)))
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(loc="best")
plt.title("Precision-recall curve")
plt.show()
Next, we evaluate a binary case. We combine Severity 3 and 4, and rename it as 1. This will indicate a very severe accident. Severity 2 will be renamed as 0, and this will indicate a light accident. Combining the labels has reduced the skewness, however there are still almost double the records in category 0 than category 1. Therefore, it makes sense to still use Precision-recall curves to evaluate the model.
df['Severity'].replace({4:1, 3:1, 2:0}, inplace = True)
df['Severity'].unique()
# Distribution of updated Severity
sns.countplot(data = df, x='Severity')
We define function that plots precision-recall for each model. The input is X, so that we can see how the results change with different features. We will therefore be able to compare how the different dimension reduction methods perform.
Overall the model that performed best was Random Forest, without any additional variables that were was reduced dimensionally. Interestingly, Random forest has a feature, where we can view which variables contribute most to the model output.
Below we explore which variable is most important using a game theoretic approach, SHapley Additive exPlanations (SHAP).
def get_iterable(x):
if isinstance(x, list):
return x
else:
return [x]
def plot_pr(X, plot_size, title):
'''
plot pr curves in a single plot
X are the features in the dataset
plot_size takes a positive value
'''
# Define label
y = ml_df['Severity']
np.random.seed(55)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.80, random_state=33)
classifier_names = ["Logistic Regression",
"Nearest Neighbors",
"Linear SVM",
"RBF SVM",
"Random Forest",
"XGBoost"]
classifiers = [LogisticRegression(C=1., solver='lbfgs'),
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025, probability=True),
SVC(gamma=2, C=1, probability=True),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
XGBClassifier()]
pred_prob = {}
pred = {}
# fit individual classifiers
for name, classifier in zip(classifier_names, classifiers):
np.random.seed(100)
classifier.fit(X_train,y_train)
pred_prob[name] = classifier.predict_proba(X_test)[:,1]
pred[name] = np.where(pred_prob[name] >= 0.5, 1, 0)
plt.clf()
fig = plt.figure(figsize=(plot_size,plot_size))
ax = fig.add_subplot(1, 1, 1)
names = get_iterable(classifier_names)
for name in names:
precision, recall, _ = precision_recall_curve(y_test,pred_prob[name])
avg_precision = average_precision_score(y_test, pred_prob[name])
ax.plot(recall, precision, lw=2.5, label=name + ' (area = %0.3f)' % avg_precision);
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.01])
plt.title(title);
plt.legend(loc='center left', bbox_to_anchor=(1.05, 0.5), fontsize=9)
plt.show();
SHAP plots rank the variables from the most important to the least. Each dot in the plot represents an observation in the dataset. The SHAP value is on the x-axis and indicates how much the change is in log-odds. The variable name is on the y-axis. The gradient colour represents the value for that variable (e,g. high temperature is shown as red whilst a low temperature is blue). Binary variables are either red or blue, which corresponds to a value of 1 and 0 respectively.
From the plot, we can immediately see that Timezone Eastern is the top key variable. The Eastern Time Zone is represented as orange in the map below. Ohio is in within this timezone, and earlier we saw that they had the second highest count of accidents. Furthermore, the plot shows that when there is a low humidity, it is more likely to result in a severe accident. Civil twighlight is a way of indicating whether it is morning or night, depending on the when the geometric center of the sun is 6 degrees below the horizon [6]. The variable has blue positive shap values indicating that accidents in the daytime are likely to lead to a very severe accident.
# Merge severity with embedding spreadsheet
ml_df = df.merge(embeddings_df, on = 'ID')
# Using original features
X1 = df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition',
'State abbreviation'])
# Test out dimension reduction methods
# Use [['city_emb2_tSNE_1','city_emb2_tSNE_2']] in addition to other variables
X2 = ml_df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition',
'State abbreviation', 'city_emb1_Isomap_1', 'city_emb2_LLE_1', 'city_emb2_LLE_2',
'city_emb3_LLE_1','city_emb3_LLE_2', 'city_emb3_LLE_3', 'weather_emb1_Isomap_1','weather_emb2_PCA_1',
'weather_emb2_PCA_2'])
# Use 'city_emb1_Isomap_1' and 'weather_emb1_Isomap_1'
X3 = ml_df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition',
'State abbreviation', 'city_emb2_LLE_1', 'city_emb2_LLE_2',
'city_emb3_LLE_1','city_emb3_LLE_2', 'city_emb3_LLE_3', 'weather_emb2_PCA_1',
'weather_emb2_PCA_2', 'city_emb2_tSNE_1','city_emb2_tSNE_2'])
X4 = ml_df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition',
'State abbreviation', 'city_emb2_LLE_1', 'city_emb2_LLE_2',
'city_emb3_LLE_1','city_emb3_LLE_2', 'city_emb3_LLE_3', 'city_emb2_tSNE_1','city_emb2_tSNE_2',
'city_emb1_Isomap_1', 'weather_emb1_Isomap_1'])
X = [X1, X2, X3, X4]
titles = ['No dimension reduction', 'tSNE on city', 'Isomap on city and weather', 'PCA on weather']
zipped = zip(titles, X)
dic = dict(zipped)
for key, value in dic.items():
plot_pr(value, 8, title=key);
ax, fig = plt.subplots(figsize = (15,6))
y = ml_df['Severity']
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=.80, random_state=33)
rfc = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1).fit(X_train, y_train)
print("1. confusion matrix:\n")
y_pred = rfc.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
shap_values = shap.TreeExplainer(rfc).shap_values(X_train)
shap.summary_plot(shap_values[1], X_train)
from IPython.display import Image
Image(filename='us-easterntz.jpg', width =500, height=300)
Overall, the traffic in the roads was not clearly affected by the accidents, nor did we see speed as a reason for accidents happening. However, road conditions, like when amenities are around, had a bigger impact on the traffic consequences of an accident. Therefore, the municipalities of the cities can take this into consideration and put in place strategies to deal with the accidents that occur next to amenities, so that the traffic is not highly affected.
US road accidents are highly correlated with the population of the states, whereby the denser states have more accidents in general than states with a sparser population. Moreover, road accidents often occurred in clear or overcast days.
We have seen that most accidents occur at junctions and in states California and Ohio. Furthermore, most accident occur during rush hour period. During the weekend, most accidents that occur are very severe.
Random Forest has been shown to be the best model from predicting the severity, having an area of 0.603. Using SHAP, we were able to see that the most influential feature was the Eastern Timezone. In future, to get better results, we would balance the data by using a method such as Synthetic Minority Over-sampling Technique (SMOTE).
[1] https://www.kaggle.com/sobhanmoosavi/us-accidents
[2] Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
[3] Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.
[4] Brownlee, J. (2021). Failure of Classification Accuracy for Imbalanced Class Distributions.
[5] Sandy (2019). California Car Accidents: Statistics and Causes
[7] Seber, G and Lee, A. (2012). Linear Regression Analysis. New Jersey, United States: John Wiley & Sons.
My contribution was the following:
| Section | Task |
|---|---|
| Introduction | Added picture, half of the introduction text, the Table of Contents |
| Part 2 | ALL (except Web-Scrapping for State names) |
| Part 3.5 | ALL |
| Part 3.6 | ALL |
| Part 4 | ALL |
| Part 6 | half |
| Part 7 | half |
As this was a group project, time was also spent in coordinating with my team-members and combining our work.
The contribution was the following:
| Section | Task |
|---|---|
| Introduction | half of the introduction text |
| Part 2.2 | Web-Scrapping for State names |
| Part 3.2 | ALL |
| Part 3.3 | ALL |
| Part 3.4 | ALL |
| Part 5 | ALL |
| Part 6 | half |
| Part 7 | half |
As this was a group project, time was also spent in coordinating with my team-members and combining our work.
The contribution was the following:
| Section | Task |
|---|---|
| Part 2.2 | Helped with webscraping |
| Part 3.1 (Objective 1-5) + Heatmap | ALL |